Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages

نویسندگان

  • Dag T. Haug
  • Marius L. Jøhndal
  • Hanne M. Eckhoff
  • Eirik Welo
  • Mari J. B. Hertzenberg
  • Angelika Müth
چکیده

This paper reports on the development of the PROIEL parallel corpus of New Testament texts, which contains the Greek original of the New Testament and its earliest IndoEuropean translations, into Latin, Gothic, Old Church Slavic and Classical Armenian. A web application has been constructed specifically for the purpose of annotating the texts at multiple levels: morphology, syntax, alignment at sentence, dependency graph and token level, information structure and semantics. We describe this web application and our annotation schemes. Although designed for investigating pragmatic resources, the corpus with its rich annotation is an important resource in contrastive and historical Indo-European syntax and pragmatics, easily expandable to include other old Indo-European languages. RÉSUMÉ. L’article décrit le développement du corpus aligné PROIEL, qui couvre le texte original grec du Nouveau Testament et les traductions latine, gotique, vieux-slave et arménienne. Pour faciliter la création du corpus, nous avons developpé une application web qui permet l’annotation des textes sur plusieurs niveaux: morphologie, syntax, alignement de phrases, syntagmes et mots, structure informationelle et sémantique. Dans l’article nous décrivons cette application web ainsi que nos schémas d’annotations. Bien que conçu pour l’étude des ph énomènes pragmaticaux, l’annotation très riche des textes a resulté à une ressource importante pour l’étude comparée and historique du syntax et pragmatique indo-européen, et le corpus pourra facilement être étendu à d’autres langues indo-européennes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphologically and Syntactically Annotated Corpora of Many Languages

Annotated corpora have become a standard resource for research in both linguistics and computational processing of natural languages. Lexicographers judge word usage and distribution by occurrences in corpora; part-of-speech tags may help them narrow their queries. Grammarians may use syntactically annotated corpora (treebanks) for queries such as “show me all examples where a verb governs two ...

متن کامل

Building a Parallel Bilingual Syntactically Annotated Corpus

This paper describes a process of building a bilingual syntactically annotated corpus, the PCEDT (Prague Czech-English Dependency Treebank). The corpus is being created at Charles University, Prague, and the release of this corpus as Linguistic Data Consortium data collection is scheduled for the spring of 2004. The paper discusses important decisions made prior to the start of the project and ...

متن کامل

The English-Swedish-Turkish Parallel Treebank

We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenizat...

متن کامل

Coreferential Relations in Basque: The Annotation Process.

In this paper we present the coreferential tagging of part of the EPEC Corpus of Basque. Although coreference is a pragmatic linguistic phenomenon highly dependent on the situational context, it shows some language-specific patterns that vary according to the features of each language. Due to the fact that Basque is not an Indo-European language, it differs considerably in grammar from the lang...

متن کامل

Tan Liling and Francis Bond . Building and Annotating the Linguistically Diverse NTU - MC ( NTU – Multilingual Corpus )

The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 375,000 words (15,000 sentences) in 6 languages (English, Chinese, Japanese, Korean, Indonesian and Vietnamese) from 6 language families (Indo-European, Sino-Tibetan, Japonic, Korean as a language isolate, Austronesian and Austro-Asiatic). The NTU-MC i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TAL

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2009